ST590 Project 3

Jovanni Catalan & Sergio Mora

Introduction

You should discuss the goals of the notebook, introduce your data set, and give the source for your data set

The goal of this notebook is to have a clear understanding of obesity rates in Mexico, Peru, and Colombia based on multiple metrics collected. This data comes to us from UCI Machine Learning Repository which gathered this data from Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico.

The data set has 17 columns and 2,111 observations. The columns are:

Gender: Patients gender {object}

Age: Patients age {float}

Height: Patients height {float}

Weight: Patients weight {float}

family_history_with_overweight: If the patients has a family history of overweight people {object}

FAVC: Frequent Consumption of High Caloric Food {objects}

FCVC: Frequent Consumption of Vegetables {float} This is different from the article need review

NCP: Number of Main Meals (how many meals the patient has daily) {float} This is different from the article need review

CAEC: Consumption of food between meals {object}

SMOKE: Does the patient smoke {object}

CH20: Consumption of water in liters {float} This is different from the article need review

SCC: Does the patient monitor the calories they consume {object}

FAF: How often does the patient have physical activity {float} This is different from the article need review, There is also some overlap

TUE: How often does the patient utilize technological devices {float} (e.g. phone, video games, TV's, computers, etc.) {float} This is different from the article need review

CALC: consumption of alcohol, how often does the patient drink alcohol {object}

MTRANS: What type of transportation does the patient normally use {object}

NObeyesdad: Patients weight status {object}

Obesity levels defined as:

Supervised Learning Idea and Data Split

Give a discussion as to why we want to what we are generally trying to do with supervised learning where prediction is our goal. Discuss why we want to split our data into a training and test set.

You should also split the data into a training and test set

EDA

You should have a narrative that goes through what you are trying to accomplish in the EDA, why you are looking at a particular graph or statistic, and how you interpret what you’ve made. The EDA should be done on the training data only. You should use pandas-on-spark or spark SQL data frames (but matplotlib is fine)

Part of the final’s purpose is to see if you can judge what should and shouldn’t be included in an EDA.

Gender

No real assumption is made here prior to observing the data as we have no reason to believe that either gender would be more likely to face obesity than the other.

Although not a huge data set it is still hard to understand our results in this format. The visualization below should help us out.

We see that overall our data is very evenly split when it comes to gender. This shouldn't come as a surprise to us. Further analysis should show if there is a correlation between gender and obesity rate.

Immediately we start to see some interesting feature of our data. We see the following:

Insufficient Weight: There are more women in who are of insufficient weight than men. This could have multiple reasons but one that comes to mind is the pressure on young women to thin.

Normal Weight: This is evenly split.

Obesity Type I: This is skewed male but not overly so.

Obesity Type II: Is predominantly male, this could be attributed to the way BMI is measure utilizing only weight and height and not higher than average muscle mass which many young men tend to have.

Obesity Type III: This is surprisingly almost entirely female. Because this is measure based on BMI it might stand to reason that if a man and a woman weigh the same a woman would likely have a higher BMI do to either height differences or assumed muscle mass differences.

Overweight Level I and II: These measurements seems to be fairly evenly split between gender with male being on the heavier side.

Smoke

An assumption made here is that smoking would correlate to someone being overweight and obese. The idea that one bad habit could lead to another as well as the assumption that smokers are less healthy because they smoke and thus might excessive less.

We see that the vast majority of our data set shows that people in mass do not smoke. For this reason further analysis on this variable would be hard to visualize without accounting for the near 49:1 ratio.

Age

Two opposing thoughts here are that younger people would be more fit due to their age and potentially being more active. However obese people might not make it to an older age to skew the data.

We see that our data is right tailed with a few data records showing people in their 40's, 50's and even 60's. We also do see that Obesity and overweight might be correlated to age since our Insufficient_Weight and Normal_Weight groups are both in the younger side of our distribution when compared to the other groups.

Due to our data being skewed we don't have huge insights into how obese people do later in life through the visual above. However a correlation test below will help with this.

Family History

"wealth begets wealth" is a common saying. Meaning wealth brings forth more wealth as in wealthy parents might raise a child who in turn will be wealthy. Family History might tell us a lot about someones likelihood of being obese. Here we will explore if "Obesity begets obesity".

We see that people with a history of Obesity of any kind seem to be more likely to be obese themselves. This seemed specially true for people that are obesity more than overweight.

Transportation Method

From both the bar chart above and the cross table we see that there seems to be some correlation between the way people move about and they weight. e.g. a lot of people who walk are in the normal weight category. A simple linear model could tell us the relationship better but all we know for now is that further analysis is needed.

Modeling

Next, you should fit three different classes of models (they can be the ones we did in class or you can branch out). You can have a numeric response or a binary response.

With each model type you use, you should describe the general idea of the model/how it works. These discussion don’t need to be super long, but they should be clear and hit on the most important points about how the model works.

You should use CV to choose among the candidate models for each model type.

• You should set up a pipeline in pyspark for each of your models

• At least one of the pipelines should include at least two transformations prior to the model fit (estimator)

• You can use the same set of transformations for multiple models (if appropriate)